Exploration and Initial Data Summaries, Partitioning
How might we transform our outcome? (Box-Cox?)
Building Candidate Prediction Models
Assessing coefficients with tidy()
Obtaining summaries of fit with glance()
Checking Regression Assumptions
Five Types of Residual Plots (3 are new today)
Assessing the candidates, in training and test samples
431 strategy: “most useful” model?
Split the data into a development (model training) sample of about 70-80% of the observations, and a holdout (model test) sample, containing the remaining observations.
Develop candidate models using the development sample.
Assess the quality of fit for candidate models within the development sample.
431 strategy: “most useful” model?
Check adherence to regression assumptions in the development sample.
When you have candidates, assess them based on the accuracy of the predictions they make for the data held out (and thus not used in building the models.)
Select a “final” model for use based on the evidence in steps 3, 4 and especially 5.
Today’s Packages
library(janitor) # clean_names(), tabyl(), etc. library(naniar) # identifying/dealing with NAlibrary(patchwork) # combining ggplot2 plotslibrary(broom) # for tidy(), glance(), augment()library(car) # for boxCoxlibrary(corrr) # for correlation matrixlibrary(GGally) # for scatterplot matrixlibrary(ggrepel) # help with residual plotslibrary(kableExtra) # formatting tableslibrary(mosaic) # for favstats and df_statslibrary(sessioninfo) # for session_info()library(simputation) # for single imputationlibrary(tidyverse)theme_set(theme_bw())options(tidyverse.quiet =TRUE)options(dplyr.summarise.inform =FALSE)
Today’s Data
The dm1.Rds data contains four important variables + Subject ID on 500 adults with diabetes.
We want to predict the subject’s current Hemoglobin A1c level (a1c), using (up to) three predictors:
a1c_old: subject’s Hemoglobin A1c (in %) two years ago
age: subject’s age in years (between 30 and 70)
income: median income of subject’s neighborhood (3 levels)
What roles will these variables play?
a1c is our outcome, which we’ll predict using three models …
Model 1: Use a1c_old alone to predict a1c
Model 2: Use a1c_old and age together to predict a1c
Model 3: Use a1c_old, age, and income together to predict a1c
income n percent
Higher_than_50K 120 25.1%
Between_30-50K 188 39.2%
Below_30K 171 35.7%
Three candidate models for a1c
Our goal is accurate prediction of a1c values. Suppose we have decided to consider these three possible models…
Model 1: Use a1c_old alone to predict a1c
Model 2: Use a1c_old and age together to predict a1c
Model 3: Use a1c_old, age, and income together to predict a1c
How shall we be guided by our data?
It can scarcely be denied that the supreme goal of all theory is to make the irreducible basic elements as simple and as few as possible without having to surrender the adequate representation of a single datum of experience. (A. Einstein)
Often, this is reduced to “make everything as simple as possible but no simpler”
How shall we be guided by our data?
Entities should not be multiplied without necessity. (Occam’s razor)
Often, this is reduced to “the simplest solution is most likely the right one”
George Box’s aphorisms
On Parsimony: Since all models are wrong the scientist cannot obtain a “correct” one by excessive elaboration. On the contrary following William of Occam he should seek an economical description of natural phenomena. Just as the ability to devise simple but evocative models is the signature of the great scientist so overelaboration and overparameterization is often the mark of mediocrity.
George Box’s aphorisms
On Worrying Selectively: Since all models are wrong the scientist must be alert to what is importantly wrong. It is inappropriate to be concerned about mice when there are tigers abroad.
and, the most familiar version…
… all models are approximations. Essentially, all models are wrong, but some are useful. However, the approximate nature of the model must always be borne in mind.
Partition the data: Training and Test Samples
Partitioning the 479 Complete Cases
Select a random sample (without replacement) of 70% of dm1_cc (60-80% is common) for model training.
Hold out the other 30% for model testing, using anti_join() to pull subjects not in dm1_cc_train.
“Mutating Joins” join one table to columns from another, matching values with the rows that the correspond to. Each join retains a different combination of values from the tables.
left_join(x, y): Join matching values from y to x.
right_join(x, y): Join matching values from x to y.
inner_join(x, y): Join data. retain only rows with matches.
full_join(x, y): Join data. Retain all values, all rows.
We want to try to identify a good transformation for the conditional distribution of the outcome, given the predictors, in an attempt to make the linear regression assumptions of linearity, Normality and constant variance more appropriate.
(partial) Ladder of Power Transformations
Transformation
\(y^2\)
y
\(\sqrt{y}\)
log(y)
\(1/y\)
\(1/y^2\)
\(\lambda\)
2
1
0.5
0
-1
-2
Consider a log transformation?
p1 <-ggplot(dm1_cc_train, aes(x =log(a1c))) +geom_histogram(bins =15, fill ="royalblue", col ="white")p2 <-ggplot(dm1_cc_train, aes(sample =log(a1c))) +geom_qq(col ="royalblue") +geom_qq_line(col ="magenta") +labs(y ="Observed log(a1c)", x ="Normal (0,1) quantiles") +theme(aspect.ratio =1)p3 <-ggplot(dm1_cc_train, aes(x ="", y =log(a1c))) +geom_violin(fill ="royalblue", alpha =0.1) +geom_boxplot(fill ="royalblue", width =0.3, notch =TRUE,outlier.color ="royalblue", outlier.size =3) +labs(x ="", y ="Natural log of Hemoglobin A1c") +coord_flip()p1 + p2 - p3 +plot_layout(ncol =1, height =c(3, 2)) +plot_annotation(title ="Natural Logarithm of Hemoglobin A1c",subtitle =str_glue("Model Development Sample: ", nrow(dm1_cc_train), " adults with diabetes"))
Consider a log transformation?
Box-Cox to get started?
mod_0 <-lm(a1c ~ a1c_old + age + income, data = dm1_cc_train)boxCox(mod_0) ## from car package
Could Box-Cox be helpful?
summary(powerTransform(mod_0)) ## also from car package
bcPower Transformation to Normality
Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
Y1 -1.0788 -1 -1.4632 -0.6944
Likelihood ratio test that transformation parameter is equal to 0
(log transformation)
LRT df pval
LR test, lambda = (0) 31.24326 1 2.2764e-08
Likelihood ratio test that no transformation is needed
LRT df pval
LR test, lambda = (1) 118.0104 1 < 2.22e-16
Consider the inverse?
p1 <-ggplot(dm1_cc_train, aes(x = (1/a1c))) +geom_histogram(bins =15, fill ="forestgreen", col ="white")p2 <-ggplot(dm1_cc_train, aes(sample = (1/a1c))) +geom_qq(col ="forestgreen") +geom_qq_line(col ="tomato") +labs(y ="Observed 1/a1c", x ="Normal (0,1) quantiles") +theme(aspect.ratio =1)p3 <-ggplot(dm1_cc_train, aes(x ="", y = (1/a1c))) +geom_violin(fill ="forestgreen", alpha =0.1) +geom_boxplot(fill ="forestgreen", width =0.3, notch =TRUE,outlier.color ="forestgreen", outlier.size =3) +labs(x ="", y ="1/Hemoglobin A1c") +coord_flip()p1 + p2 - p3 +plot_layout(ncol =1, height =c(3, 2)) +plot_annotation(title ="Inverse of Hemoglobin A1c",subtitle =str_glue("Model Development Sample: ", nrow(dm1_cc_train), " adults with diabetes"))
Non-numeric variables removed from input: `income`
Correlation computed with
• Method: 'pearson'
• Missing treated using: 'pairwise.complete.obs'
# A tibble: 3 × 4
term a1c_old age inv_a1c
<chr> <dbl> <dbl> <dbl>
1 a1c_old NA -0.167 -0.607
2 age -0.167 NA 0.155
3 inv_a1c -0.607 0.155 NA
Scatterplot Matrix
I select the outcome last. Then, the bottom row will show the most important scatterplots, with the outcome on the Y axis, and each predictor, in turn on the X.
ggpairs() comes from the GGally package.
temp <- dm1_cc_train |>mutate(inv_a1c =1/a1c) |>select(a1c_old, age, income, inv_a1c)ggpairs(temp, title ="Scatterplots: Model Development Sample",lower =list(combo =wrap("facethist", bins =10)))
Scatterplot Matrix
Three Regression Models We’ll Fit
We continue to use the model training sample, and work with the (1/a1c) transformation.
mod_1 <-lm((1/a1c) ~ a1c_old, data = dm1_cc_train)mod_2 <-lm((1/a1c) ~ a1c_old + age, data = dm1_cc_train)mod_3 <-lm((1/a1c) ~ a1c_old + age + income, data = dm1_cc_train)
Assess fit of candidate models in training sample.
.resid = residual (observed - fitted outcome) values; larger residuals (positive or negative) mean poorer fit
.std.resid = standardized residuals (residuals scaled to SD = 1, remember residual mean is already 0)
What does augment give us?
aug1 <-augment(mod_1, data = dm1_cc_train) |>mutate(inv_a1c =1/a1c) # add in our model's outcome
aug1 also includes:
.hat statistic = measures leverage (larger values of .hat indicate unusual combinations of predictor values)
.cooksd = Cook’s distance (or Cook’s d), a measure of the subject’s influence on the model (larger Cook’s d values indicate that removing the point will materially change the model’s coefficients)
plus .sigma = estimated \(\sigma\) if this point is dropped from the model
No Studentized residuals with Bonferroni p < 0.05
Largest |rstudent|:
rstudent unadjusted p-value Bonferroni p
303 3.648534 0.00030617 0.10257
A studentized residual is just another way to standardize the residuals that has some useful properties here.
No indication that having a maximum absolute value of 3.65 in a sample of 335 studentized residuals is a major concern about the Normality assumption, given the Bonferroni p- value = 0.103.
GVIF Df GVIF^(1/(2*Df))
a1c_old 1.040833 1 1.020212
age 1.053313 1 1.026310
income 1.042187 2 1.010384
Collinearity = correlated predictors
Remember that the scatterplot matrix didn’t suggest any strong correlations between our predictors.
Is collinearity a serious issue here?
vif(mod_3)
GVIF Df GVIF^(1/(2*Df))
a1c_old 1.040833 1 1.020212
age 1.053313 1 1.026310
income 1.042187 2 1.010384
(generalized) Variance Inflation Factor tells us something about how the standard errors of our coefficients are inflated as a result of correlation between predictors.
We tend to worry most about VIFs in this output that exceed 5.
What would we do if we had strong collinearity? Drop a predictor?
Predict the test sample using these models
mod1 prediction errors in test sample
The augment function in the broom package will create predictions within our new sample, but we want to back-transform these predictions so that they are on the original scale (a1c, rather than our transformed regression outcome 1/a1c). Since the way to back out of the inverse transformation is to take the inverse again, we will take the inverse of the fitted values provided by augment and then calculate residuals on the original scale, as follows…
Given this tibble, including predictions and residuals from the three models on our test data, we can now:
Visualize the prediction errors from each model (see Appendix).
Summarize those errors across each model.
Identify the “worst fitting” subject for each model in the test sample.
Comparing Model Predictions
Calculate the mean absolute prediction error (MAPE), the square root of the mean squared prediction error (RMSPE) and the maximum and median absolute error across the predictions made by each model. Let’s add the validated \(R^2\) values (squared correction of fit and observed), too.
Model mod_2 has the smallest MAPE (mean APE) and root mean squared prediction error (RMSPE) and the smallest maximum error, and the best (highest) validated \(R^2\) value.
Model mod_1 has the smallest median absolute prediction error.
Identify the largest errors
Identify the subject(s) where that maximum prediction error was made by each model, and the observed and model-fitted values of a1c in each case.
Excluding subject S-002, mod_2 still wins four of five summaries.
“Complete Case” Conclusions?
In-sample model predictions are about equally accurate for each of the three models. mod2 looks better in terms of adjusted \(R^2\) and AIC, but mod1 is better on BIC. There’s really not much to choose from there.
Residual plots look similarly reasonable for linearity, Normality and constant variance in all three models.
No substantial signs of collinearity.
In our testing sample, mod_2 has the smallest MAPE (mean APE), RMSPE and maximum error, and the best validated \(R^2\), while mod1 has the smallest median absolute prediction error. All three models are pretty comparable. Excluding a bad miss on one subject in the test sample doesn’t change these conclusions.
So, what should our “most useful” model be?
431 strategy: “most useful” model?
Repeating what I discussed at the start of class…
Split the data into a development (model training) sample of about 70-80% of the observations, and a holdout (model test) sample, containing the remaining observations.
Develop candidate models using the development sample.
Assess the quality of fit for candidate models within the development sample.
431 strategy: “most useful” model?
Check adherence to regression assumptions in the development sample.
When you have candidates, assess them based on the accuracy of the predictions they make for the data held out (and thus not used in building the models.)
Select a “final” model for use based on the evidence in steps 3, 4 and especially 5.
Three predictor candidates, so we could have used…
a1c_old alone (our mod_1)
age alone
income alone
a1c_old and age (our mod_2)
a1c_old and income
age and income
a1c_old, age and income (our mod_3)
Would Stepwise Regression Help?
We’ll try backwards elimination, where we let R’s step function start with the full model (mod_3) including all three predictors, and then remove the predictor whose removal causes the largest drop in AIC, until we reach a point where eliminating another predictor will not improve the AIC.
The smaller (more negative, here) the AIC, the better.
Stepwise Regression on mod_3
step(mod_3)
Would Stepwise Regression Help?
Start: AIC=-2524.07
(1/a1c) ~ a1c_old + age + income
Df Sum of Sq RSS AIC
- income 2 0.000325 0.17405 -2527.4
- age 1 0.000821 0.17455 -2524.5
<none> 0.17373 -2524.1
- a1c_old 1 0.095609 0.26934 -2379.2
Step: AIC=-2527.44
(1/a1c) ~ a1c_old + age
Df Sum of Sq RSS AIC
- age 1 0.000796 0.17485 -2527.9
<none> 0.17405 -2527.4
- a1c_old 1 0.096438 0.27049 -2381.8
Step: AIC=-2527.91
(1/a1c) ~ a1c_old
Df Sum of Sq RSS AIC
<none> 0.17485 -2527.9
- a1c_old 1 0.10226 0.27711 -2375.7
Stepwise regression lands on our mod_1, as it turns out.
There is a huge amount of evidence that variable selection causes severe problems in estimation and inference.
Stepwise regression is an especially bad choice.
Disappointingly, there really isn’t a great choice. The task itself just isn’t one we can do well in a uniform way across all of the different types of regression models we’ll build.
More on this in 432.
ggplot2 for residual plots?
Residuals vs. Fitted Values plots are straightforward, with the use of the augment function from the broom package.
We can also plot residuals against individual predictors, if we like.
Similarly, plots to assess the Normality of the residuals, like a Normal Q-Q plot, are straightforward, and can use either raw residuals or standardized residuals.
ggplot2 for residual plots?
The scale-location plot of the square root of the standardized residuals vs. the fitted values is also pretty straightforward.
The augment function can be used to obtain Cook’s distance, standardized residuals and leverage values, so we can mimic both the index plot (of Cook’s distance) as well as the residuals vs. leverage plot with Cook’s distance contours, if we like.
Demonstrations on the next few slides.
Residuals vs. Fitted Values: ggplot2
ggplot(aug1, aes(x = .fitted, y = .resid)) +geom_point() +geom_point(data = aug1 |>slice_max(abs(.resid), n =5),col ="red", size =2) +geom_text_repel(data = aug1 |>slice_max(abs(.resid), n =5),aes(label = subject), col ="red") +geom_abline(intercept =0, slope =0, lty ="dashed") +geom_smooth(method ="loess", formula = y ~ x, se = F) +labs(title ="Residuals vs. Fitted Values from mod_1",caption ="5 largest |residuals| highlighted in red.",x ="Fitted Value of (1/a1c)", y ="Residual") +theme(aspect.ratio =1)
Residuals vs. Fitted Values: ggplot2
Standardized Residuals: ggplot2
p1 <-ggplot(aug1, aes(sample = .std.resid)) +geom_qq() +geom_qq_line(col ="red") +labs(title ="Normal Q-Q plot",y ="Standardized Residual from mod_1", x ="Standard Normal Quantiles") +theme(aspect.ratio =1)p2 <-ggplot(aug1, aes(y = .std.resid, x ="")) +geom_violin(fill ="ivory") +geom_boxplot(width =0.3) +labs(title ="Box and Violin Plots",y ="Standardized Residual from mod_1",x ="mod_1")p1 + p2 +plot_layout(widths =c(2, 1)) +plot_annotation(title ="Normality of Standardized Residuals from mod_1",caption =str_glue("n = ", nrow(aug1 |>select(.std.resid))," residual values are plotted here."))
Standardized Residuals: ggplot2
Scale-Location Plot via ggplot2
ggplot(aug1, aes(x = .fitted, y =sqrt(abs(.std.resid)))) +geom_point() +geom_point(data = aug1 |>slice_max(sqrt(abs(.std.resid)), n =3),col ="red", size =1) +geom_text_repel(data = aug1 |>slice_max(sqrt(abs(.std.resid)), n =3),aes(label = subject), col ="red") +geom_smooth(method ="loess", formula = y ~ x, se = F) +labs(title ="Scale-Location Plot for mod_1",caption ="3 largest |Standardized Residual| in red.",x ="Fitted Value of (1/a1c)", y ="Square Root of |Standardized Residual|") +theme(aspect.ratio =1)